To: Art Unit 21 86 Batallle. Pierre Michel Page 19 of 94 



2004-10-29 00:55:02 (GMT) 



16507451098 From: Shahram Abdollahi 



ApplioationNumber 10/017,676 



Amnt A contd. 



16 of 22 



REMARKS: 



General 



Apj^icants have rcwritlBn all the claims to define the invention more particulariy and cfistinctiy 



so as to overcome the technical rejections and define the invention patentaHy over the pior art 
All these claims are based on what have been shown in both the specification of record and the 
substitute ^^ecification. 

Independent claim 10 describes a novel apparatus for nxiting address lookup, which incorporates 
an associative memoiy array togpther with a novel deterministic address entries categorization 
method implemented through the applicants' new content comparable memoiy (CXI^M) device. 
Any particular memory bank of the content addressable memory banks is only activated if the 
routing address is deterministically known to be in that memory bank, which saves the power 
consumed by the apparatus. The results achieved are new, unexpected and have never been 
suggested befi>re. 

Dependent claim 1 1 adds the implemeiitation of the categorization of the address entries based 
on prefixlengths. In addition 1o the benefits realized through claim 10, this removes the need to 
use temaiy CAM aimys and also priority encoders inside each CAM bank, wiiich results in 
smaller appenatus sizes aixl higher operation speeds. 

Dependent claim 12 recites one embodiment of lealiziiig the prefix leng^ categorization of the 
address entries in hardware. 

Dependent claim 13 extends the implemenlaficn recited in claim 10 up one level, fixxn memoiy 
bank level to the nx>nolifliic intonated device level, whh the same benefits and advantages. 

Dependent claims 14-15 recite same matters as claims 11 and 12, this time with the additional 
matter of claim 13. 

Independent claim 16 recites a novel af^>aratus for comparing two binary words in magnitude, 
which is named a Content Comparing Memory (CCM) by the applicants. The implementation 
results in a compact device which is smaU eix>ugh to be called a memoiy cell. It is not much 
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laigerlfaanabinaiy CAM ceUaixl his about Ifae same size as a te^^ Hie 
compadness of 4ie cell allowed "te inventors Id use the cell extensively in tiie overall invenlioa 
The lesuhing cell is new, unsTq^ected and has never been suggested befoie. 

Dependent claims 17-27 add various permutations of the following limitations to the independent 
claim 16: - The memory cell storii^ both the bit and its inverse, - having an inverter to create the 
logical inverse of tte stored bit, - deliverii^ the inverse of the search binary bit to the cdl, and 
- the logic devices inside the memory cells being implemented through transmission ggtes. 

Indepeixtent claim 28 recites a metkxl for routing address loofct^D^ which iixxwporates a novel 
deterministic address entries categorization method Any particular memoiy bank of a device, 
which uses this method^ will be only activated if the search routing address is determirristically 
known to be in that memoiy bank, which saves the power consumed for the routing address 
lookup. Same as claim 10, the results achieved are new, une^qjected and have never been 
suggested before. 

Dependent claim 29 extends the categorization method recited in claim 28 up one level, fiom 
memoiy bonk level to the mondithic int^;rated device level, with the same benefits and 
advantages. 

Independent claim 30 adds the method of categorizalicMi of the address entries based on prefix 
lengjdis. In addition to the benefits realized through daims 28 and 29, this eliminates the need to 
use priority encoders inside memoiy banks to resolve multiple prefix length matches^ resulting in . 
higher operation speeds. 

Independent claim 31 recites essentially the same mediod for routing address lookiq> as claim 28;, 
but it extends the method to any hierarchy of storage and search elements in an address lookup 
search engine. 

Independent claims 32-34 recite various methods to compare two fainaiy wokIs in magnitude 
using the CCM apparatus recited in claims 16-27. Due the novelty of the CCM cell, the methods 
for usii^ it are also novel and therefore patentable. 
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The Objection to the Disclosure Due to infomialities (items 1&2 in the office 
action) 

The claims have been rewritten, with care taken to avoid infixmalities. The speoificatkm has also 
been substituted with a more appropriate format. 

The Rejection of Claims 1 -9 under 35 U.S.C. §102(a) as Being Anticipated by 
Miyatake et al (item 4 in the office action) 

In their paper, Miyatake et al . show a CAM cell with FMOS match- line drivers instead of the 
usual >IMOS drivers. This choice lesutts in a lower swing on the mat^ • 
power consumption. It also lesults in c^lain benefits in tiie prechai^ge levels of Ullines 
matchlines. Nfiyatake et al. also show a match sense amplifier with an NMOS passgjate between 
its iiput and tiie matchline Id further leduce tte 

also makss sure that the matchline FMOS switches do not leak in off state, as their g^te is driven 
to vdd-vtn, rather than fidl vdd 

This application's new claims 10-15 and 28-31 show m>vel apparatus and methods thereof for 
routing address looki^, based on what have been shown in both the specification of record and^ 
the substitute ^^ecification. Claims 28-31 recite methods for detenninistically activating only 
certain portions of an associative memory based looki:^ search engine during a lookup operaticm. 
Claims 10- 15 recite the architecture to imflement the methods in claims 28-31 in hardware. 
What are recitsd in these claims are ix3t shown by Miyatake et al. 

New claims 16-27 and 32-34 show novel apparatus and methods thereof for a compact content 
compaiing memoiy (CCM). Usir^ this memory cell and the method described in tiie claims, a 
search entry can be compared in magnitude with the contents of the memory. This memory 
aocompiishes this task while its size remains small such that it can be easily integrated on a 
monolithic integrated device. This device is used in the implementation of the apparatus and 
methods thereof recited in claims 10-15 and 28-31. Again there is no mention of such a device in 
the paper by Miyatake et al. 
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PeuIs of dependent claims 1 1, 12, 14, IS and independent claim 30 recite categorizing tfie address 
entries based on prefix lengths. Although Miyatake etal. mentionthisinlh&ir p^r, theydonot 
combine it with other iK^vel features cited in the above - mentioned claims. Therefore, these 
claims recite a novel combination and are patentable. 

Based on the observations above, apf^cants submit that the new claims 10-34 cleaiiy recite 
novel subject matter which distinguishes over vAisit is anticipated by Miyatake et al. 

The Rejection of Claims 1 -3 under 35 U.S.C. §102(e) as Being Anticipated by US 
6,665,297 (Hariguchi et al) (item 5 in the office action) 

In tfaeir patent, Hariguchi et al. show a new netwoik roiiting table architecture vvhich has the 
fdlowtng features: 1- It uses hash circuits in parallel with content addressable memoiy (CAM) 
anrays. 2- it dvides address s^)ace based on prefix lengths annong the hash drcuite 
aixays. 3- it uses a new method for hashitig operation 

They do not show the novel apparatus and methods thereof for routing address lookup recited in 
claims 10-15 and 28-31. They also do not diow anything vsiiich resembles the CCM apparatus 
and method thereof recited in claims 16-27 and 32-34. 

Hariguchi et al. s1k>w the categorii^tion of the address entries based on fxefix lengths. As 
ejqpiained in the Miyatake et al. case above, this categorization is recited in dependent claims 1 1, 
12, 14, 15 and independent claim 30 together, but it is in combination with other novel features 
not shown by Hariguchi et al. Therefore, these claims recite a novel comtanation and are 
patentable. 

Hariguchi et al. also show the use of mask memoiy arrays for the implementation of the above- 
mentioned categorization. Dependent claims 12, and 15 recite the utilization of a mask memory 
array, but again they do so in combination with other novel features not showed by Hariguchi et 
al. Therefi>re, these claims recite a novel combination and are patentable. 

Based on the observations above, applicants submit that the new claims 10-34 cleariy recite 
novel subject matter ^^ilich distingdshes over what is aiitidpated 
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Other Prior Art Made of Record and Not Relied Upon in This Office Action 

Other references cited in the o£5ce action (papsr by McAuley et al. and US patent by Muller et 
al) have been reviewed by Hie appUcants. 

McAuley et al. discuss various usagps of fainaiy and temaiy CAM anays in address lookup. In 
some of the metfaodsi, they show the categorization <rf'the address entries based on prefix lengths. 

MiiUeretal. discuss a general arehitBcture for a nBtwolks^^itc^ l-Apacket 
header processing unit, 2- a search engine using a regular CAM array. They recite using mask 
bits in one implementation of the CAM search engine. 

None of these references applicants' invention or render it obvious. 
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CONCLUSION 

For all &e above reasons, aj^licants req^ectfiilly submit tbat the specification and claims aie 
now in pioi^r fomi, and that the claims all define patentability over the prior ait Therefore, they 
sulnnit that this ^licaticm is now in fiill condition for allowance, which action they respectfiilly 
solicit. 

Conditional Request for Constructive Assistance 

ApplKants have amended the specificatkm and claims of this applicatim so that fh^ are proper, 
definite, and define novel stmctuie and method vMjdtt axe also unobvious. for any reason, this 
i^lication is not believed to be in fidl condition for allowance, applicants respectfidly request 
the constructive assistance and suggestions of the Examiner pursuant lo M.P.E.P. § 2 1 73.02 and 
§ 707.07Q) in order that tfie undersigned can place this application in allowable condition as soon 
as possible* 

Veiy Respectfid^, 

''Shahram Abdbllahi-Alibeik Mayur Joshi 




PO Box 19389 

Stanfonl,CA 94309 

Tel: 650-575-6690; Fax: 650-745-1098 

Enclosed: 

# A substitute specification in clean fonn without ^maikings as to amended mater^ 

A maiked-up copy of the substitute ^>ecification showing the matter being added to and the 
matter being deleted fiom the specification of record. 

New sheets 1/6 to 6/6 of drawings. 
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Ceitificate of FacsimUe Transmission. I certify that on the date 

oQmmunication, and attachments if any, to GAU 2186 of the U.S. Patent and Trademaik Office 
at 703-872-9306. ^^^.^^ 

Date: \Of3^^/^0o^ h^cotot'sSip^stm^^ 
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Marked-Up Copy of the Substitute 
Specification 



NOTES: 

1. Added materials aie in italics. 

2. Deleted materials arc in s tnkptluou^ . 

3. Materials fiom Ifae original specification aie in normal fonts. 

4. All figures are moved to Ifae drawiiigs sectiori, enclosed sepaialely. Deleted figures are 
maiked [FIGURE DROPPED] in Hieir caption. 
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22883 

PAtOfTTItADCMAlCK OFFICf 

This application is submitted in the name of the following inventor(s> 



22883 204.1001.02 

PAlBfTTHADEMAiaC OFUCC 



3 Inventor Citizenship Residence City and State 

4 Shahram ABDOLLAHl-ALIBEIK Iran Palo Alto, Cahfomia 

5 MayurVinoodJOSHI India Palo Alto, Califoniia 

6 ■ 

7 TITLE OF THE INVENTION 

8 . 

9 . High-Speed Low-Power CAM-Based Search Engine 

10 " 

11 BACKGROmJD OF THE I>fyENTION 

12 ' 

13 Field of the Invention 
u . 

: . ' Thio invontion rolatcs generally to oooroh ongines for youtoro, 

16 

17 Related Art 

18 

T h e Internet wa s cr e at e d for the pnrpooc of sharing infermatiuji bcLwceu 
^ yeseareh uriivprgitieg and g o vernment agencies. How e ver, with groDtor commercial an d* 

indixridual uce, Internet traffic has grown exponentially, loading to incr e ased probloma 
22 ■ flogociotod with renting moDsagoo. — Menoog e o arc routed (or switched) by routing or ^ 



EL 73481 5352 US 
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1 gw i tghing dg\ricc8, herean called **roiiterg , " — Routgfc serv e as th e principal d e vicog by - 

2 which messages are received and f o nvardgd ^ring comniunicotion potho. — Routcis pt€f- 

3 crably ool e ct subptantially optimal patho form e osog e o in reoponoo to foctora related t o the 

4 condition of the paths and the status of the networV. 

5 

6 . One known problem is that router cpccd i s often not ouffici e nt to kocp up 

7 with mcQoago traffioi M e csag e i? aro sont using packoto, relatively Gmallcr blockg of data 

8 t hat ore rec e ived by a router (or switch) and forwarded on to thoir dcgtinotion(g). Packets 

9 - include at l e ast two partSi (1) a header with the address of th e pouroo and doatinatio n 

10 compute r s, and (2) dota to bo oont with the pocket. For example, upon oonding an e mail 

11 message to a recipient , the m essa ge can be brok e n into multipl e pookoto for t iavcl on diF - 

12 f er cnt communication patho within the Internet to tho ultimate doQtination> 

13 

14 — In pr e ferred embodiments^ routers use known protocols including TCP 

15 (tfansmisoion control protocol) and IP (internet protocol) Ac irnr.wTi m fti^ m-t f>f rrx»fiwig 

16 ■ mooflagea, TCP breaks meccages into data packets and reassemble them in order once they • 

17 re ach thoinilriTmtp Hprfimtimi Aff lmm»m in tht* atf rraiftng mAcgagf»c IP g«;ytgnty ^f i Ch 

18 packet a unique source IP addrecc (and at least one IP destination a ddrccs)i Each router 

19 le a ds the destination IP iiddresses and determ i nes which pa t h is the bes t one for the- 

20 j mokot to tnkn itn iiltimnfft dnntmntimi Tn prafbtwH AtuhnHimPTitPj iht* wiif/^r wr^olr^e ^^^jy 

21 determination in response to a routing table of routing tr ea tment s for each destination IP 
22 — addres s^ 

■ 2 ■ 

EL 734815352 US 
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1 

2. The routing table inclu d cg a get of routing tre a tment^ g ascociat e d with dnch 

3 destinati o n IP address (and pcttribly aly g responsive t o the sciurrp TP nddrcCT in the case of 

4 multieoot pack e to). Tho routing table can be con s tructed by the router in recponco to in - 

5 foijuation obout the notwork gl e aned by th e routor> roc e iv e d by tho routor from other 
e — routers^ and in r e spons e to other footors. " 

7 J. ; • 

8 In th e precencc of relatively hca\ry network trafiiCj the router can become . 

9 overloaded; that is, p a c k ets a rrive a t the router f a st e r than the rout e r - can prooeoo them. In 
. 10 ruch cases» excess p a ckets are queue d for their particular input interfac e at th e rout e r, and - 

11 . r em a in queued until proce ss ed. If the queue overflowsy the router can b e forc e d to refuge 

12 to process o ne of more paekctfl, cnufling thooo paolcotc to not r e ooh thoir deotinationg, 

13 - Although nlhf ; r messagf; spnriing protoc o ls generally have rotiy techniquoo for response to 

14 thic problemy mess a ge retry can be-subst a ntially time consuming an d result in pcrf iMfn- 

15 auce Uegradatloii at tho routor < 

16 

17 One kno^m technique to impro\fing router speed includes ucing a CAM 

18 (content addressable memor^f) to speed operation of the router in acccsping its routing to - 

19 tie. Although thic technique h a s the advantage of improving mwter ^peed, i t i s subject t o- 

20 s e vemi drawbackci Firpt^ CAM deviceo havo - rel a tively high pow e r roquiroihonto, -ft»s 

21 re quiring oxpcnoivQ packaging. Second, CAM devices have relatively limited amounts of 

22 . memory that c a n be store d o n a single chip , Third , CAM devices are relatively difficult to 

3 
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1 coordinate among multiple chipg^ with the regiilt that CAM rize can b e com e a limiting 

2 fnrtor for router operation. 

4 Accordingly it would be advantageauo to provid e q tochniquo fo r CAM 

5 ■ loolcup in Q routor that is not ijubjeet to drawbacksi of the imown art > 

6 ■ 

7 SUMMARY or TlIE INVENTION 

10 tent a d drecs a blc memory) based ASIC (applioation opooifio intcgiatcJ circuit) to allo w 

11 lookups to keep up with mecsafi e spe e ds over optical fiben). 

12 ■ 

13 In a preferred embodiment , the ro^iting table i nchideg at leayt one, and poc - 

14 - gibly m a ny, c hips witli CAM mem o ry banks . E a ch chip cont a ins entri e s from d aolectcd 

15 range of the address sp a ce, a nd '^vithin each chip those entries are farther divided - -i n to 

16 -F^vnral banks. — Preferably, each bank contains entries of identical prefix length. De 

17 - pmding rm thp nnmher of pntrips fnr Pflf.h prpfty length on f^nrh rhip cw^l Kaiilr^ m^ y 
t8 bo lequired to fltoro ontrioc for a singl e pr e fijt lengths Within thos e s e veral bonkg, each 

19 uinpt inrhides mtrics from a selected address range Thus, rarh address lonlnip need only 

20 act i vate one bank per prefix length in order to obtain a lookup motcbi ' 

21 ■ * . 

* 4 
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1 A Copt e nt CompantMe Memc i iy (CCM) is contained vvithin e ach CAM 

2 . b ank; thig CCM stores and compares thg least posrible address that will match tho on trios 

3 in the table >vith the mconiing address If t h e incoming addrcsB is foiind to bo groo teff^ 

4 equa l t o the d a ta st o red in the CCM but less than tho doto in tho nejit bonk^s CCM wliitli 

5 cnntains addressps of the same prpfiv length y the i ncoming addre s s will be dirootod to the 

6 ■ rest of the CAM hanir for comparison. ■ 
7 

8 - DETAILED DCGCRIPTION OF TllE PREFERRED EMBODIMENT 

9 / 

.10 In this application, a preferred embodiment of the invention is described 

11 with regard to process steps and data structures. Those skilled in the art would recognize, 

12 after pmisal of this application, that embodiments of the invention can be implemented 

13 using circuitiy or other structures adapted to particular process steps and data structures^ 

14 and that implementation of the process steps and data structures described herein would 

15 not require undue experimentation or further invention. 

16 

17 TECmnCAL APPENDIX 

■ 18 - * 

1^ The cnolosed tochnical appendix inoludec farther information and dotail r e- 

I 20 ga i - ding the invention and a prcforred embodiment thereof Tho oncloscd technical ap* 

21 pendix io on intogml port of this applicntion> and io h e reby incorporatod by refcronce as i f^ 

22 it were fi a lly set forth heroin. — * 

5 • ' 
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1 ' ' . . ' 

2 AUemaiiye Embodiments 

3 ■ ' . 

4 Although prefeiTcd embodiments are disclosed herein, many variations are 

5 possible ^hich remain in the concept, scope and spirit of the invention, and these varia- 

6 tions would be clear to those skilled in the art after perusal of this applicaticm. ^ ^ 
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1 



ABSTRACT OF Tf IE DICCLOGURE 



2 



3 



Th e invention providos a method and oyotom foi integf ating o CAM ' based 



. 4 ASIC that will allow lookups to keep iip iv i tfa trangmicgion speeds ov e r optical fibers . 

5 The looku p table inclodca chipa with CAM banlco, Eoch chip oontoina ontrica finom only ct 

B o c rtain range of the addrecs cpace» within each chip the cntriog oro divided into several 

7 bonks. Each bank contains cntrioo of th e cam e pr e fix length. Depending on the num b e r of 

8 . entries in each prefix length on each chip several banks may b e required to stor e th e s e 

9 entries . Each bank cont a ins entries contained in a particular addropg rango. Each eddre ag 

10 lookup will activat e on e banlc per pr e fix l e ngth in order to get a match. A Content Com ^ 

11 pornblo Klomory (CCM) io contained within each CAM bank; thio CCM otoreo and oom - 

12 pares tlie lea&t possible addrcjg that will match the entries i n the table with the ineomin g 

13 oddroflg. If th e incoming addresg is fo^nd to be greater or equal to the data stored in th e 

14 C CM but less th a n the data i n the next ban k ^s CCM which c ontains addresc e s of the sam e 

16 i parison . 



9 
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1^ Level: Inter-Chip on Address Space 

Level: Inlra-Chip on Refix Length 
Level: Inlia-Chip on AddiessSpace 



Fl^pire 1: Conceptual overview of the ardiitecture for the lookup method 



1 .0 Ov e rv ie w 

One of the limiting factor in future high - speed IP routers i s the problem of performing the 
longoot prefix match to decide out of which port th e . incoming packet ahould b e rout e d. Th e 
long es t pr e fix match involv e s loolcing at th e d es tination addres s on th e incoming packet and 
finding the longest prefix in th e routing tabl e that matches it. As th e rate at which data pack e t 
can be tranomittod over the optical fib e rs is increasing, the number of loolcupo that need to be 
done to keep with this sp ee d is also increa s ing rapidly. H e r e we propos e a hardware s olution in 
the form of a CAM (Content Addressable Memory) based ASIC (Application Specific Integrated 
Circuit) that allows lookups to keep up with fransmis s ion speeds over optical fibers for the next 
o e verol >' e an). Th e odv^antag p sof tfaiom e diodoie: 
Discussion of Prior Art. 

One cfthe most important applications for this invention is to perform address lookups in 
routers. A router is an electronic device that has several irgmt and output ports. It receives data 
packets with destination addresses on each of these ports. The lookup fimction involves looking 
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up these destination addresses in a table called a lookup table to detemnne which output port 
this particular data packet should be sent to, so that it gets to it *s destination most efficiently. 
The router then sends the packet out of the output port determined by this lookup. 
One method of doing this is to build a list of all the possible destinations and the best port to 
send a packet out of to reach each destination. Then each destination address can be searched in 
this table and the best outgoing port can be determined However, in large networks like the 
internet the number of such destinations is so large that such a table becomes too large to be 
implemented in each routers. Another consideration is the maintenance of this table. Now each 
time a new destination is added to the network each router in the network has to be informed of 
this. Thisisvery cumbersome for largenetworks. 

Hence, to solve this problem, a prefix based lookup scheme is used to carry out routing in 
modem internet routers. The ideahere is that the networkisarrangedinahierarchical fashion 
and the cuidresses are allocated accordingly, son^what simitar to a postal address. For example 
take an imaginary postal address like 123, Some Street, Sotneville, Somestate, US. The zip code 
has been dropped to make the postal example more ancdogous. Tims, a letter with this address 
postedfram anywhere in the world would first be sent to ZJS. Next, the US posted system will 
cSrect the letter to Smnestate, from where it will go to the city Someville and so on and so forth 
Thus, this system eliminates the requirement for every post office in the world to have knowlec^e 
ofwhere 123, SomeStreetisandhow to deliver a letter to that address. Sindlarly prefixes allow 
the aggregation of eritire sub^networks under (jne entry in the lookup table. 
However, there are special cases that need to be taken care of Again falling back on the postal 
system analogy, from scmw places in Canada it is more effiaent to send a letter to Alaska 
directly there rather than sending it first to the mainland US postal system. Thus, these Canadian 
postal offices would have a letter routing rule book that has two entries: send letters addressed 
to US to the mainlaruiUS postal syst^, sendletter addressed to Alaska, US to the Alaska postal 
system. Here clearly the second rule has higher precedence over the first one. This can be 
expressedas longest prefixmatdting. Thus, oneshoulduse the rule with the longest or most 
specific prefix for routing decisions. Similarly, in routers ffje longer pr^ix has a higher priority 
than a shorter prefix This is the basic concept betnndCIDR (Classless Inter-Domain Routu^ 
which is used in routers. 
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Even though this concept cuts down on the number of entries that need to be maintained in the 
routing table, nevertheless the number of entries in the routing tables of routers in the backbone 
cf the internet are large at around 100, 000 today. To prcmde for adequate margin for growth 
during the lifetime cf these routers, currently routers are shippedwith the ability to support one 
million entries. Today these address are 32 bit long (under a scheme called IPv4) but as the 
stock of available address are depleted, 128 bit long address (IPv6) are coming into use. 
Another factor that is making this task difficult is that the speed of the links connecting these 
routers is growing with rapidadvances in technology. The state of the art optical fiber links 
todc^ can run at 10 Gbps (called OC -192). Considering that trdtnmum sized (40 bytes) data 
packet are sent over links of this capacity a lookupspeedof slightly aver 30 million lookups per 
second is required Systems currently in developrmnt will support link speeds of 40 Gbps (OC- 
768) requiring a lookup speed of over 120 million lookups per second This lookup speed is 
requiredfor each link to a router, A router may have several links comiected to it Thus, owrall 
the problem is to search for the longest prefix match for each address among a million prefixes 
at the speedofseverallmndredlookups per second Usingjust prior art this is a daunting 
problem Tlw parameters ofinterest are power consun^tion, number cf chips requiredtostt^re 
and search the table catd the chip area of these dnps, latency of search, and the rate at which the 
search can be performed 

An example of a lookup table used f or forwarcSng is slK>wn in FIG. l.Ecujh entry in this table is 
32 bits wide. The first column contcanthe prefix with the prefix length cfier the V\ EcBchSl bit 
address is grouped into four decimal numbers each represmtir^ 8 bits. The Jour decimal 
numbers are separated by a decimal point. For exan^le the 32 bit long address 1010 101 1 0011 
0110001000000001 0101 is 171.54.32.21 inthisformat. 

Using these conventions, the entry, 1 71. 54. 32. 0/24, refers to the range cf addresses from 
1 71. 54. 32. Otol 71. 54. 32. 255. Hence, theftrst 24 bits are definedwhile the last 8 bits are *'don 7 
care'* bits. Another representation for the prefixes wouldbe 171.54.32JC, where the X stands for 
**don 7 care The mtgoing port is in the next column. An incoming address can match multiple 
entries. In this case the entry with the longest prefix is chosen as the best match if a CIDR 
algorithmis desired For example the word 17 1.54.3 2. 123 matches two entries from the table in 
FIG. 1, narnefy 17 L54.0.0/16 and 17 1.54.32.0/24. However since 17 1.54.32.0/24 is alonger 
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prefix than 171.54, 0. 0/16, the best match is 171,54,32, 0/24. Another method of establishing 
priority would be to actually specify the priority for each entry. 

An alternate way to represent this table is shown in FIG, 2. Here each prefix is represented as a 
range along a number line (shown at the bottom of the figure). Since we are dealing with 3 2 bit 
prefixentries inthis exanq}le, this number line extends from 0, 0. 0. 0 to 255. 255, 255. 255. Each 
prefix is a contiguous range on this number line. The prefixes from the table in FIG, 1 are shown 
on this number line. Note that the longer prefixes represent shorter ranges on thisnumber line. If 
a longest prefix match is desired, then the first range that matches the address to be looked up 
going from top to bottom in FIG. 2 is the best match 

There are two general approadtes to solving this problem The first is to use a genercd CAM 
(Content Addressable Memory) to store and search Ae entire lookup table. Each CAM cells 
contains two memory elements to store three states ( 1, 0^ or don *t care) and comparison 
circuitry to compare the destination IP address to the stored entry. This cpproach results in 
large sdicon area as well as large power consunption as every entry is searched 
The secondcpproach is to store the lookup table as some data structure in conventionctl memory. 
For example see U.S. patent 6, Oil, 795. This data structure is designed to allow efficient lookup 
usif^ a particular algorithm A specialfy desi^ied integrated circuit is used to perform the 
lookup on this memory. While the power in this scheme can be low, it suffers from several 
drawbacks. Any data structure involves a lot of wastage cite to either empty entries or pointers 
usedtona\ngatethestructure.Thefactorcf real prefix ciata to memory usedis 3-4^^ , 
can be as badas 64. Secondly to run this lookup at a high speed, each level of this datastructure 
has to be pipelined This puts a large I/O re^^pnrement on the system Which is difficult if not 
impossible to meet as the nunAer of lookups req^iredexceed 100 Million lookups per second 
Hem^e current techniques are expensive atuihave unmanageable amount ofworst-casepower 
andl/O requirements. Another disadvantage is that the latency of these solutions can be large 
andalsotheworst'Case latency mc^ be much larger than the average case latency. This large 
and possibfy uncertain search latency requires larger and more ccnvplex buffering cfthe data 
packets. 

Objects and A dvantages 

Accordingly, to address the deficiencies of prior art several objects and advantages of the 
presentinventionare: 
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♦ It does not suffer from the high power requirements of usual CAM implementations allowing 
the use of cheaper packaging and higher density reducing the chip count. Power does not 
scale with increasing taUe size, unlike conventional incrementations. 

♦ It allows the use of a binary CAM structure in place of a ternary CAM (which can store don't 
cares) giving higher table entries per chipt 

♦ It has low latency which is beneficial to applications like real time voice and video 
transmission 

♦ It can siq^port a high locJoq? rate allowing the routing of a large amount of traffic 

♦ It allows several chips to be operated in parallel with ease, to support large lookup table sizes 
as there is no communication required between chips to decide the best match 

Further objects and ad\;antages are to have a solution which is easy to design. Still further 
objects andad)?antages will become cq>parentform aconsideration of the ensuing description 
anddrawings. 

This method is easily extendable to solving other problems which require looldng an 
e ntry in a table at a hi^i rat e . 
Brief Sumnuuy of the Invention 

This invention provides amethodandsystem ctnASIC (ApplicationSpecificIntegrcitedCircuit) 
with several CAM arrays to perform asingle-diinensional prefix search on the prefixes storedin 
the said airay such that as few as one CAM array is activate at a time. Each of these arrays are 
surroundedby special logic tlmt activates only the necessary CAM arr^^ 
BriefDescriptionqf the Several Views of the Dra^ving 
List of Figures 

FIG. 1 shows cm example qf a forwarding lookup table with one dimensional prefixes. 

FIG. 2 shows the representation cf a one <£mensional lookup table as ranges along a number 

line going from 0.0.0.0 to 255,255.255.255. 

FIG. 3 shows the concept qf dtvicSng the lookup table into different subgroups dependng on /te 
locationalongthe number line. 

FIG. 4 Chip-level architecture qf the preferredernbocbment 

FIG. 5 Schematic of one possible implementation of oonqHxring circuitry in one cell of the 
Content Comparing Memory 
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FIG. 6Schematicofanotherpossibleimplementationofcomp€ain^ cell cfthe 

Content Conq>armg Memory 

FIG. 7 SchemaUc of preferred embodbmeracf one complete ceUc^ 
Memory 

2,0 Archit e cture D e scription 

Detailed Description of the Invention 

The basic idea behind this invention is. to divide the table of prefixes into smaller subgroups. This 
allows this invention to save on power and inq}lementation area requirement as conq^ared to 
prior art. To aid in understanding this invention, first the methodof cSvidng a large table of 
prefixes into smaller subgroups will be described Next the hardware to store, identify^ and 
search the correct subgroup will be described 
Basic Theoretical Concept 

Each entry in the lookup tatJe consists of a prefix of a certain length. For example, a 32 
bitaddress with a prefix length of 16 is 23.123.0.0/16. This prefix can be thought to represent a 
range of addresses from 23.123.0.0 to 23.123.255.255. In figure 1 each of these ranges is 
represented by a square bracket facing upwards. Entries of the same prefix length are placed on » 
the same line. 

Searching entries takes power roughly proportional to the number of entries that need to 
be searched. This scheme saves power by searching only a few entries out of the entire table. 
The way the table is divided is shown in figure 1 . Depending on the technology, chip size and 
table size, several chips may be required to save and search the entire table. Hence, the first 
division is between chips. Each chip contains entries from only a certain range of the address 
^>ace. Entries that cross this boundary are put in both chips. Thus depending on the range in 
winchtheaddtess to be matched &lls, only one chip needs to be searched for it. 

Within each chip the entries are divided into several packs. These packs will be referred 
to as banks. Each of these banks shares a mask entry, which stores the information on the 
significant hits in the prefix. This allows the entries to be stored in smaller binary CAMs instead 
of ternary CAMs^ which are otherwise required. Snce, each bank contains entries of the same 
length the entries cannot overiap with each other. Thus^ each address will get at the most one 
match. This eliminates the need to have a priority encoder within each bank to resolve multiple 
matches. For these reasons the second division is based on prefix length. 
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The third division is from bank to bank. Depending on the number of entries in each 
prefix length on each chip several banks may be required to store these entries. Each bank 
contains entries contained in a particular address range. Each address lookup needs to only 
activate one of these banks per prefix length, further reducii^ the power requirement. A priority 
encoder is required betvsreen banks to determine which was the longest prefix match among the 
matches fiom dfferent prefix lengths. 

Note that depending on the specific application, technology used and table size, the 
number, order or type of this division can be chai^d to c^Ttain Ifae optimal desiga 
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FIgMre 2: Schonadc of chip used in the implementation of the lookup 

method 
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3.0 Functional Block Level Implementation 

This section deals with the functional block level implementation, while the details of 
circuit implementation is presented in the following sections. Figure 2 shows the schematic of 
the implementation. Each thick solid line represents a flipflop. Thus, the regions between 
flipflops of the same color lie in the same clock domain. The functioning of this schematic will 
be explained fay going through a lookup (an address) and add/delete prefix cycle. 
Lookup: 

A particular interface is assumed here for the sake of discussion. In a cycle in which 
there is an address to be looked up, the address is put on the in_addr bus, the packet ID is put on 
the m_pktjd bus and in_valid is asserted. Next this address has to go through the first check 
to find out if it is in the same range as the address in this chip. This is the search on the first 
division. This search is done by the use of CCM (Content Comparing Memory). In this 
implementation, without loss of generality, CCM is used to compare the incoming data to that in 
the memory and computes if it is greater than or equal to the one in the memory. A possible 
imi^eineritalion of tiie CCM is presented in 4ie next sections. 

So, in the next cycle the incoming address is compared against two CCMs to check if it is 
in the right range. The CCM contain the maximum and minimum of the range of address 
contained in that chip. Chips that do not have addresses in the right range do not have to do any 
further work on this address saving power. The chips that does natch now passes on the address 
to the CAM banks in the next cycle. 

Now, as mentioned before each of these CAM banks contain entries with the same prefix 
length. This prefix length is encoded in the mask present in each bank. The data in the mask 
decides which bits of the incoming address will be compared with the entries in the bank. Each 
CAM bankalso contains aCCM. This CCM stores and compares the least possible address that 
will match the entries in the table with the incoming address. If the incoming address is found to 
be greater than or equal to the data in the CCM but less than (i.e. not greater than or equal to) the 
one in the next bank which contains addresses of the same prefix lengtfi, then and only then the 
incoming address is passed to the rest of the CAM bank for comparison. This requires CAM 
banks with prefixes of the same length to be placed next to each other and the addresses to be 
sorted between tiie banks. Note that the addresses within a bank need not be sorted as only one 
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' match can be made for entries of same pM-efix length. The last in a chain of CAM banks with 
same prefix length should not compare the incoming address with the next CAM bank (as that 
contains prefixes of different length). This is achieved by introducing the last bit. So for the last 
CAM bank in a chain (which has the last bit set) comparison is carried out only with one CCM. 

In the next cycle the comparison within each CAM bank that matched (at the most one 
per prefix length) is carried out. The circuit operation and design of these CAM cells is detailed 
in the following sections and hence, will not be covered here. It is sufficient to say here that 
each row of CAM cells (which contain one entry) have an associated memory row (e.g. SRAM) 
containing the tag (which could be the port address that the packet needs to leave the router by). 
If a match is found between the incoming address and one of entries in the bank, corresponding 
tag is outputted and a hit line is asserted 

In the next cycle the priority encoder decides which of the CAM banks has got the 
longest prefix match. Again, the workings of the priority encoder are explained in detail in the 
following sections. The priority encoder decides the CAM banks with the highest priority and 
lets it output its tag (VkMch is the lotigest prefix malch) onto tfie m bus. 
Update: 

This section shall detail how the data structure is maintained. A processor that maintains 
the update engine gives the update commands. To allow lookups to take place without being 
held up by updates, each update command maintains the data structure intact. This requires all 
the CCMs and CAMs at various levels to be updated in one pipelined operation (so as to leave, 
the data structure ready to do a lookup in the next cycle). This means that each update is one 
clock cycle long and updates each section as it travels down the pipeline. The lookup operation 
can resume after the clock cycle in vsMch the update is introduced to the pipeline. 

To add a new entry to a chip, the entry is placed on the iii_addr bus and the 
corresponding tag is placed on the injport bus and the packetjupdate is asserted. The bank 
address that this update is directed to is put on the updatejblk_addr bus, while the row number 
within this bank is put on the update_cflm_addr bus. Now, this addition might change the data 
structure, so as to require the modification of the following CCMs: 

• Bank CCM: If tiie entry is the smallest in that bank, the CCM content has to be updated. The 
updatejcon bus is asserted which ensures this. Note that the m_addr should contain the 
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smalle st address that matches the new entiy . The mask in the CAM bank >vill ensure that the 
relevant bits are ignored during lookup. 

• Lo_CCM: This contains the lowest address than can get a match on this chip. Thus, if the 
incoming entry is the smallest in the chip, the update_lo_ccin is asserted. Again the 
in_addr bus should contain the smallest address that matches the new entry. 

• Hi_CCM: This contains the highest address that can get a match on this chip. Thus, if the 
incoming entry is the largest in the chip, the updatejiijccm is asserted. In this case the 
in_addr bus should contain the largest address that matches the new entry. Note that this 
update will never require the concurrent updating of the bank CCM. So, putting the largest 
address on the in_addr bus will not cause a problem. 

A delete is similar to an add, except that the entry is set to a special value that will never 
match a valid incoming address. In the design part we first came up with a compactcircuit 
implementation of the CCM cells. Then we tried to optimize the critical path delay to obtain the 
maximum i^^eed 

4.0 Architecture and Floorplan 

On e of th e ioou e o with th e orohit e otur e w e ohoo e woo th e updat e . At firot glano e it o ee mo 
that th e updat e will b e v e r>^ tim e oonBumtng and of th e ord e r of numb e r of entri e o. But we 
oboer^^ e d that although the IP ontrioo ahould b e oort e d b e tw ee n tho CAM banlco, th e y do not have 
to be sorted incide each CAM bank. This reduces the number of operations needed for each 
update from 0(N) (N — number of entries) to 0(N/M) in which M is the number of rows in each 
CAM banlc. In our arohiteotur e w e aooum e d M~1Q0, ao th e updat e operation io on th e ord e r of 
1000 operations. In rare caoeo when the boundar>^ between prefix lengths should be moved 
across the CAM banks, the number of operation s for update may increase. By assigning enough 
number of CAM banks to each prefix length ba s ed on s tatistics we try to avoid the s e kind s of 
updat e s as much oa poooiblo. 

Th e oth e r i ss u e with our arohiteotur e i s that w e hav e to hav e e xtra bins in e ach chip to 
take oar e of th e otomge ar e a whioh io wast e d between pr e fix l e ngth boundari e s. W e n ee d to hav e 
25 e xtra bins (numb e r of diff e r e nt pr e fix l e ngths, from 8 to 32) p e r chip. By looldng ct the floor 
plan in fig. 3, th e area p e nalty^ io only 10%. This area p e nalty^ will d e oreao e ao w e go to larg e r 
ohipo and omall e r f e atur e siz e t e ohnologi e o (l e oo than 2% for a lom* ohip oiz e in 0.15^im 
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Hgpira 3 : Floor Han of llie CMp [FIGURE DROPPED] 
toohnolog>0' Thuo we obtain a v e r>^ low povvor dooign, oino e for oaoh lookup only on e ohip io 
ootivQted and in e ooh ohip at motrt 25 CAM bonloa or e fir e d 
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Th e floorplan ao m e ntion e d abov e io ohovvn in fig. 3. Aloo th e pip e lining of th e op e rationo 
io ohovvn in fig. 2, All latoheo with tho oamo oolor reprooont ono op e ration oyolo. The lat e ncy for 
oaoh ohip (and for th e vvholo looloxp aino e on e ohip p e r loolcup io aotivatod) io 6. The oxpanoion 
of th e g>^Qt e m by adding ohipp is aloo v e r>^ oimpl e and io only limited b>^ th e numb e r of addr e oo 
lineo allocated for addressing ths memory^ looolions. 

For e otimating th e area, w e uo e d th e formula giv e n in th e proj e ot handout. For Ci\M o e llo 
we a ss umed an area of 5QA;?</10^.^ s ince \ve are u s ing s maller tran s i s tors than the s tandard cell 
and we oould stock the vias. For SRAM oell we assumed an area of 50^, since the SRAM 
height io th e oam e ao th e CAM c e ll h e ight. Thio wa>^ e ach CAM banlc will hav e an area of 5.7 
Ivftr ^. The logic for each bank requireo 1.3 M^^. Our Priorit>f Encodero require 1 Mk,^. Our 
decoders require 60 IvR;^. So the total ohip area will be 0.74 om " ^. Since the aspect ratio of our 
chip io 1.1 1:1, tho ohip will be 0.91om>< O.gtom. 

4 ,1 Funotiona l testing and Vor i log Status 

Th e ob je oti^^ e b e hind writing th e v e rilog cod e woo 

■4 — Test the soundne ss of the idea ^ . 

^ — Ch e ck for any pot e ntial archit e ctural bottl e n e cto (lilt e larg e numb e r of wir e o running ail ov e r 
tho chip) 

A — Ensure tfag update can be implemented keeping without impocling the kx)kup my much 

W e hav e m e t th e s e obj e otiv e o. Th e v e rilog cod e wao writt e n right down to th e funotional block 
level (i.e. just above the gate level); For example we implemented each CAM block with all the 
associated peripher>^ logic. We could not figure out how to do iterative wiring and vectorized 
inotantiationo in vorilog and oinoe w^ e want e d to impl e m e nt tho mod e l withoorrootfunotionalit>^ 
we took recourse to Perl to do the needful. 

Th e controller wao implomontod mootly proc e durally. While loolcup w^aa prop e rly impl e m e nt e d, 
updat e io much mor e difficult. So w e just impl e m e nt e d a oimpl e updat e algorithm that cannot 
take care of spills to adjacent bin s t — However^ we did write a Perl script that took the 
l6okup_tablc and erootod Ifao initialization filo with oorroot bin portitionfl. 
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Sino e w e hav e q thr ee haoh otruoture w e n ee d to worr>^ about thr ee Idndo of haoh ov e rflowo. For 
ov e rflowo in th e addrooo opaoo haoheo the updat e algorithm io 0(Nuinbor of Bin- 2000) for 
ov e rflowg in prefix l e ngth haoh e o the update algorithm ia 0(Numb e r of e ntri e g" - ' 2QQK). W e 
d e vio e d th e arohit e otur e in ouoh a way that an>^ CAM and th e aff e ot e d CCMo oon b e written in 
the same olook oyole without destroying the data struoturo integrit^^ Thus we oan continue doing 
loolcupo, using only th e idl e oyol e o to do th e updat e s. For addr e so opao e haoh ov e rflowo thio will 
e ntail a p e nalt>^ of Q( 1 Vo) whioh io quit e omall and oan b e olip e d in with holding up th e lookup at 
all. However a prefix length hash overflow oan in the worst ca s e hold up lookup for 
millio e oondo. Th e paving grao e io that th e s e updat e o (i. e . ohang e s in distribution of routing pr e fix 
length) ar e lilcely to b e ver>^ infr e qu e nt Q(montho) and thus with ol e v e r algorithms will It ee p thio 
psnal^ down to a minimum 

4 .2 S e tup off the circu i t critica l paths 

From the figure of the overall chip operation in section 2*0^ there are 6 cycle s per ohipi 

Th e firot and loot cycl e s ar e I/O. Th e o e oond and third oyol e s ar e CCM op e rations and th e fourth 
one is the CAM bank op e ration. Th e fifth one is th e chip priority encod e r. The CCM opemtions 
and CAM operation are vary similar to each other and both aoom to bo in the oritiool path. So 
• both of these circuits should be simulatedi The priority encoder is a static logic evaluation and 
s hould not be a probl e m in t e rm s of timing, but s ino e it is an important function of th e chip, it • 
was simub tedr- 

TTiB plaoem e nt of the wireo for the units are shown in fig. 1. 

6y0 Circu i t Approach and Simulat i on Method 

As w e disouoood in the pr e vious sections, w e hav e 2 basio Content Comparabl e 
Memories; The usual CAM and the CCM, For the usual CAM we used the serie s matchline 
structure (or the NAND matoh chain). As mentioned before this was u s ed to save power. Since 
we have a large number of CCM also (2018^ of them) , we had to come up with a compact 
structure for this memory. We observed that for comparing our IP address with the CCM 
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EigMre4: FlacementofiOiemresfordieunit [FIGURE DROPPED] 

content, we can subtract these 2 numbers and see if the result is a negative number or not. In , 
logic terms, this means that we have to 2's complement one of our numbers and add them 
together. If the overall addition result is positive (i.e. the ectra Wt for 2's complement is 1 ) there 
would be a carry generated, otherwise there would be no carry. We used a carry^chain 
aichitBctuTB to imfdement our CX3M. 

It is not desirable to do a 2's complement operation on the IP number for each lookup. 
One solution is doing the 2's complement operation on the CCM content when it is stored during 
an update. Another solution is storing the original CCM content, but do the carry chain logic 
operations on the inverse of the stored value. In this case there should be a carry input to the 
cany chain. Since 2* s complement of a binary number is equal to bitwise inverse of that number 
plis 1, die end result will be the same as the first solution. 
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Effectively the CCM content is subtracted from the IP number each time and a carry is 
generated if the IP number is Greater than or Equal to CCM content, hence the name G.E.CAM. 
Two possible implementations are shown in Fig. 5. Figs, (a) and (b) correspond to first and 
second solutions respectively. In both ceises transistor Ml can be connected either to Vdd (done 
in Fig. 5(a) implementation) or to bitline (Fig. 5(b)). Connection to hitlim may make the 
overall cell size smaller. Of course there could be other implementations for generating the 
inputs to Ifae series and parallel cany chain transistois. 




GEiiii 



"3 




1 



(a) 



PACE 63W4 • RCVD AT 10/28/2004 8:55:04 PM [Eastern Daylight Time] ' 8VR:U8PTO-EPXRF-1/0 • DNIS:8729308 » C8ID:16S07451008 * DURATION (mm-ss):63-18 



To: Art Unit 2166 Balaille. Pierre Michel Page 64 of 94 2004-10-29 00:55:02 (GMT) 16507451098 From: Shahram Abdotlahl 



Application Number 1 0/0 1 7,676 



Marked-Up Specification 



Page 26 of 49 



■QEIri 



•if/ 




a>) 

Figure 5 — CCM cany diain 



W e us e d on e phao e olooldng oohomo. Eaoh oootion hao on e vvholo olook oyolo to porform 
ito op e ration. Statio logio oootiono (lilce the decod e r and Priorit>f Enoodor) have th e whol e oyol e 
time to evaluate and so their timing is not critical » For precharge logic sections (like the CCM 
and CAM), the olook low oyol e is used for preohoig a and olook high is ua e d for evaluation. r - ^ 

To pav e pow e r our bitlin e o ar e not pr e ohorg e d (thio will out down th e pow e r by half). 
With thi s scheme it se e m s that there is no guarant ee that all th e node s in the matohlin e ohoin ar e 
preohorged. To e liminate the potential charg e s haring problem, w e send th e bitlin e data duri ng 
th e pr e charg e p e riod go that the matohlin e nod e s which are going to b e conn e ct e d to th e 
matohlin e outputo ore preohorg e d coneoti yr 

As was discuss e d in th e pr e vious se ction and al s o from the above paragraph, w e hav e two 
clos e critical paths, the CCM op e ration and th e CAM bank e valuation. Both w e r e simulat e d in 
s pice. For CCM, the whole path wa s s imulat e d, from th e output of th e pr e vious stag e flip flop to 
th e input of th e next stag e flip flop. Since it was observed that th e number of addr es s e s with 
mor e than 21 bit prefix l e ngths ore wry small (0.1% from statistioo), we decided, without loss of 
g e nerality, to allocate a fev i ^ bins for more than 2 4 bit prefix lengths and th e n use 2A hit CAM^s 
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and CCM'o for otoring th e r e ot of th e IP numb e ro. Sino e th e word l e ngth for oil CCM'o in thio 
implemontotin vvoo 21 bi^ vvo did tho oimulatton for thio wond l e ngth. 

For CAM, oimulation path waa from the output of th e pr e vioua otogo to the input of oonao 
«Hp lifi e r (on e row of data). Th e d e lay of th e o e no e amp woo th e n e otimat e d and add e d to th e 
overall delay number. Sinoe the longest word length of CAM oells is 32 bit, we did our 
oimulation for 32 bit long word 

For inputo w e apoum e d that th e y all have IQQpo ris e and fall tim e (F01 ris e tim e in TT SS- 
oomer in whioh we are measuring the speed of the oirouitr>0- For output of each section we used 
appropriat e loading ? 

All th e wir e oapooitano e s and r e oistano e o w e r e includ e d in th e simulation. Th e s e w e r e 
extracted from tho general wiring s cheme discu s sed in previous seotions. For oapaoitanoes worst 
valu e woo consid e r e d, that io it woo aooum e d that th e m e tol layer io oandwioh e d b e tw ee n top and 
bottom m e tal layers. For all oas e s th e worot cas e d e lay was oimulat e d, i . e . it woo aooum e d that the 
signal should travel tho whole length of wir e , although for aomo loads it had to tmvol short e r 
distono e a. 

Since we simulated one complete row in CAM» for many of the logic circuitry^ like the 
SRAM road write circuitry^ th e corr e ct load end fanout ar e already^ th e r e in the oirouit. For signals 
and controls that go to all th e rows (like bitlinooj, enabl e signals and clock signals), th e n e ooooary 
. fhnout woo oimulnted b>r uoiig dummy gates and cc^xicitiv e loads. 

For measuring th e critical path delay, all the initial condition s w e re set to the oppo s it e of 
th e final valu e s during th e cycl e w e or e simulating. This way all th e pr e oharg e and e valuation 
time s aie taken into account conectiy. 

We al s o did the simulation for the write op e ration in the CAM banlos (including tfie 
d e coding) and Priority Encod e r, although from initial hand calculations it was obvious that th e 
timing of these circuitB i s not an issue. 

6,1 Circuit Dos i gn, and Crit i oal Path Simu l at i on 
5,1,1 CCM S i mulat i on 

ThecircuitBchematic for each cell is shown in Fig » 6» Since there is no read operation 

from tho CCM coll, thoro io no nood to have mor e than minimum siz e NMOS'o in tho SRAM 
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o e lL For th e logioal op e rotiono (bitlin e .D) and (bitlin e +D) tronprniooion gat e logioo w e r e us e d to 
oo^^o QroQ. Th e GE lin e ( e quivalent of matohlin e in normal CAM) tronoiotoro wore ohoo e n to bo 
16^ in size in order to decrea s e the delay. The GE lines are implemented with minimum width 
(for low e r oapaoitanoo) Ml, oino e th e diotanooo are not that far. The wire oapaoitanooand 
r e oiotano e w e r e aloo includ e d in th e unit c e ll. Wir e oapaoitano e wao aooum e d to b e Cgnd + 
2*CQij, booouoQ during tfap ovoluation half oyolo all tho ncorl^^ linoa oro ailont 



GElh:.ii> 




figure 6 - CCM unit ceU 
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Th e bttlin e oirouitr^^ io ohown in Fig. 7. Th e bitlin e inv e rt e r wao ohooen to b e minimum 
oiz e b e oauo e th e oapaoitano e on thio bitlin e io only^ e quival e nt to 62X on bitlin e and l^X on 
bitline_bar, Eight of these cell s are connected to each other to form a CCM 8 bit block. The 
wowBin e wiro load for thio 8 bit ooolion io ohown in Fig, 8. 




Rg^T-TheCCMcdlwidibidinedrcuitiy [FIGURE DROPPED] 




I W«vdin 



R&S-lliewonilineweloadforSbitsectionof CX^ [FIGURE DROPPED] 
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Conn a oting all th e o e 21 bito (3 8 bit o e otiono) vvill r e oidt in a lai^ e d e lay (around 10 no). 
So matohlino buflForing io noododto roduoo thio tim e . BufYoro ar e put botwoon oaoh 8 bit oDotion, 
with the total of 2 buffer ooto (2 invortoro for e aoh oot). Th e bufforo botvvoon 2 oootiono aro ohovvn 
in Fig. 9, 




GCin 



W4. 



4 i^: i& fi:. $ ii: IH 4^ 



b W ) big: 



bis 



T ^ ^ ^ ^ 



Wp:Lp 



Figure 9 -8-bit blocks ^di buffers between [FIGURE DROPPED] 

Now the queotion ia that ohould w e pr e ohargo our GE lin e a low and then ohorg e th e m 
high for a GE (Great e r than or Equal to) or ohould w e do it oth e r way around, pr e oharg e th e m 
high and th e n diooharg a th e m wh e n a GE ooouro? If w e had no intonnodiat e bufif e ro (only a final 
bufForX th e n maybe preoharging GE lin e o low and th e n charging th e m high malcoo oeno e . That 
way th e lin e o initially go up v e ry faot (b e oauo e tranoi^3toro are on) but r e ach th e final valu e very 



But thio option io not good if w e have bufFero in botwoon. In a t e ot opio e oimulation, e v e n 
though Domino gat e o w e re uo e d for buffering, th e d e lay did not improv e from the no buff e r oao e 
by that much, The problem is that even if the buffers are skewed in such a way that they s witch 
at a low tiiroohold value, th e ir inpxit oignal will nev e r be v e r>^ otrong and oo th e buffer otagoo hav e 
a long d e lay. So b e oauo e of th e oe oono e rno th e lin e o w e r e pr e oharg e d high and then diooharged 
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in oao e of a GE. The preoharg e trancdotor io oloo ohovvn in Fig. 9> Th e pnaohorg e io don e through 
anNMQS tmnoiotor. Th e roaoon io that wo do not want to charg e our GE line moro tiian Vdd Vt 
Charging bit lin e a moro than that will turn off the GE lin e tranoiatoro ovon moro and will olow 
down th e GE lin e . 

Eaoh buffer set consists of two inverters. At first glanoo it s eem s that the first inverter 
should b e olce w e d ouoh that it ovritoh e o at q high threohold voltag e . But th e faot io that th e GE lin e 
nev e r oharg e o up moro than Vdd 2Vt which oan b e v e r>^ lov . ^ (Around 2volto for Vdd~3volto). So 
the first inverter s hould be aotually skewed such that the threshold level is lower than normal 
inv e rt e r to provid e enough noio e margin. By doing a s e t of simulation ow e ep n , t h e >^^MoaW nmos 
was ohoo e n to b e QJZ to provid e a owitohing point of around 1 volt (a noise margin of about 1 
volt). The second inverter was skewed the same wa>^ but this time for speeding up the oirouit. 
The sizes where chosen to be ^\^ jMOg= ^^ and Wp MOsP ^X for both invertersi For the inverter at 
the end of GE lino the oamo aizoa wore uaod. Thoao wore obtained by optimizing the RC lino 
dela>^ mo delr 




IlgMrelO-lTieou^iutlog^rfwCCM [FIGURE DROPPED] 
Th e output logio for CCM io ohown in figur e 10. Th e output of this CCM should b e s e nt 
to th e logio of the pr e vious CAM bank^o CCM, Th e wire mod e l io shown in the figur e . Th e 
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output io buffer e d b e fore oeading it to th e wir e to minimiz e delay. The buflf e ro or e aloo oiz e d for 
boot delay. The bufibro in tho other path ar e added int e ntionally to oqualiz e th e delay of both 
pothfl aa much aa poBsiblo. If this ia not dono^ th e n th e re ia a poaoibility^ of oreating glitoh e a in the 
e nabl e output (wh e n GE lin e io diooharg e d for both thio CCM and th e n e xt on e , but th e n e >£t on e 
arrive s later becau s e of wire delay). The s e unwanted output s actually slow down the oirouit. 
With th e GLz e o giv e n, th e r e was no unn e o e Qoar>^ tranoition on Enabl e output. Th e margin wao oo 
muoh that even if on e of tii e potfao b e oom e Glighfl>^ fiix?ter, Ifa e Enobte g^itoh will b e v e i>^ omall r 

As we see in Fig. 10 there i s a Mast' signal in the logic. If the 'last* signal is 1 in the next 
CAM banle, th e n th e e nabl e output will only d e pend on th e GE from curr e nt CCM. Thio io 
e ?qjlain e d in th e archit e cture oootion. Thio 'laot' oignal io otored in MSB of maokbit, oino e th e 8 
MSB' s are not used in mosldng (therB oie no prefix lengths l e s s than 8). 

Th e clock circuitry is ohown in Fig, 11. Th e wir e loading io abo includ e d in th e mod e l - 
(for the wir e which ahould go through th e height of th e CCM c e ll). 




Figure 11 -aockdrcirilry for CCM [FIGURE DROPPED] 

Th e d e lay was m e asur e d for th e worst cas e GE op e ration, i. e . th e oarr>^ io g e n e rat e d in th e 
LSB and should propagate all the way to the end. Tho measured delay from input to enable 
output was moai3urod to bo 2.61no. Thio was measurod in TTSS oom ^ 

Sinc e th e threshold voltag e of th e first inv e rter in each buff e r o e t was set to bo low (by 
weahsning the PN^OS) we checked the SPSS comer» The circuit worked properi yr 

Th e input capacitanc e for clock io e quival e nt to 18^ . Th e input capacitanc e for bitlin e io 
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6.1 »2 Normal CAM Sinulation 

Many ibpuqs in CAM deoign are similar to the CCM^b isouep and oo are already 

disoussed The basic CAM cell is shown in Fig. 12. As explained before we have both 32 bit and 
2i bit CAM bonlm. Sinootiio32bitQnoiomor 9 oritiool, wo will mmulatD tiiio on e? 




Fi^ 12 -Basic CAM ceD [FIGURE DROPPED] 
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Th e matohlin e wir e mod e l io th e oam e qo th e GE lin e in CCM. Choooing th e NMQS 
tranoiotor oiz e oo minimum piz e (^A;) h e lpo minimizing th e ar e a, op e oially that CAM o e ll io 
covering most of the chip area. The matchlinc transistor and the transmission gates arc al s o 
ohoo e n to b e ^ . Th e r e ar e mad e ao omall ao poooibl e to m ee t th e 7no cycl e tim e obj e ctiv e . Th e 
loading of oth e r 99 rows on tho bitline and aloo the bitline vviro oapaoitanoo and r e oiotano e ar e 
modeled as shown in Fig. 13. For wire oapooitonoe still the formula Ctotal=Cgnd + 2*Cadj was- 
uo e d (worot oao e aoouming M2 b e low and abov e ). Sinc e th e two bitlin e o ar e owitohingin 
opposite dirootiono, it oooma that Ctotal ~ Cgnd + ^*Cadj should b e uo e d for wire oapaoitano e . 
But since the s e bitlines ore around 25^ away from each other, Cadj is pretty small and can be 
ignored ^ 
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n& 13- Modeling tfte loading on the hitlines [FIGURE DROPPED] 



Th e bitlin i B input oirouitry^ ia ahovvn in Fig. 14. Th e maalc implomontation ioekloo ohown in thia 
figur e . Only 2^ brito of Ci\M hav e thio impl e m e ntation and for th e oth e r 8 (MSB), maok bit io 
always 1. The ma s k bits are stored in an extra row of SRAM cell s . When 'mask' bit ia 0, then 
both bitlin e o will b e charg e d to high. Thio way th e input of th e matohlin e tranoiotor will be 
al\ % o>TO high ao d e oir e d for th e maolted bi ter 







Figjure 14- CAM Bidine input circuitiy [FIGURE DROPPED] 
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For d e oigning th e driv e ro w e did not want to minimiz e th e d a lay of th e drivoro, b e oauc e 
th e y ar e op e rating during the preoharg e tim e and oo th e ir timing io not oritioal. Our only 
constraint were having an input capacitance of 300X (the acceptable output capacitance for the 
FF w e d e oign e d in HW #1) and aloo p e rforming th e r e quired logio. Wo aloo tried to oqualizoth e 
d e lay of th e two patho. Th e d e lay of th e bitlin e path io almoot 7.1F01 and th e bitlin e _bar io 
around 7.8FQ 4 (I, A m), Thia delay time ia ooooptoHo oongidoring our oyolo timo. 

For driving the CAM bitline switche s , the circuit s hown in Fig. 15. is used. When 

Update oignol io high it m e ano that w e want to writ e into th e CAM and oo th e owitoh e o or e turn e d 
on. Likowio e wh e n Enable oignal ia higli it m e ano w e want to do a looloip opomtion and oo again 
tlio bitline owitoh io turned on. Wh e n Enable oignal io low w e do not write anything to th e bitline 
to oav e pow e r. Th e output of thio logic o ee o th e oapaoitano e of a wir e th e l e ngth of th e CAM 
batik width . The delay of the 2 ixiths are equalized . The *Swcam* delay is around 3 . 7 FO^ and 
th e 'Swboam' d e lay io around 3.9 FQ^ d e layo. Th e input oapxioitano e o w e r e aoaumed to b e lOOX 
for both 'Update' and ^Enabl e ' oignolo to bo within th e rang e of 200Aj load for FF. 




r^ure 15- CAM bidine switches circuitiy [FIGURE DROPPED] 



For sp ee ding up th e CAM op e ration, w e divid e d th e 32 bit CAM to 2 16bit CAM blooko 
and then NORed the output r e oulto. Sinoo our SRAM oello ar e ohortor in height than th e CAM 
oollOy there io enough opao e b e low e aoh SRAM row for an oxtra pitch of wire. So aa ohown 
before, the 2 16 bit CAM banka are placed on the two sides of the SRAM bank and th e output of 
on e of them is pas se d b e low th e SRAM cell s to th e oth e r sid e . Th e output logic of th e CAM 
bloolco io ohown in Fig. 16. Each 16 bit block io aloo divided to 2 8bit bloolco and buff e ro ar e 
used between these 2 blocks^ the same way as OCM designi 
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Elgpire 16 -CAM output lo^c [FIGURE DROPPED] 



Sino g w e do not want to aotivat e SRAM vvordlin e o wh e n the CAM bonlc io not activ e , w e 
includ e d the * e nable_bar^ signal in th e NOR operation. This way wh e n ' e nabl e '~high, th e oirouit 
oporatoo and when 'onablo'~lovv, tho SRAM wordlino remaino low. Aloo Iho %vordline of th e 
CAM is NQRed in the cecond NOR gate for the write operation in the SRAM during an update 
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For th e firot NOR, the oiz e o w e r e ohoo e n do that th e thr e ohold voltag e io low e nough to 
avoid th e problem dioouoood in proviouo o e otion. Tho appar e nt e ff e otivo PMQS to NMQS ratio 
h e r e io 12/( 4 /3)''9 whioh ia mor e than 5 whioh wag aojaumod for inverter buffer gatea. Th e rooaon 
io that wh e n for e xampl e GEl io otill high (around 2 volto), ' e nabl e _bar' and GE2 oan b e both 
low, whioh turns on the 2 PMQS tronsistoro and effectively deoreases the PMQS chain 
r B oiotono e . 

Sinc e th e logical e ffort of th e first NQR io v e r>^ high (around 8), th e n e xt NQR gat e io 
chosen to be minimum size. The overall LE from input of oeoond NQR to the wordline is around 
55/20^5/3-^.6. Th e gat e o ar e oiz e d to g e t minimum d e lay. Th e ov e rall d e lay of th e logic io 
e otimat e d to b e around 5FQ1, of vvidoh mor e than 3FQ1 io &om th e firot NAND« 

In Fig. 16 logic for generating ^enable_bar' is also shown. At the input, the elk signal is 
NAMD e d with* e nabl e ' oignal oo that wh e n elk io low (preoharg e otate), th e 'enabl e _bar' oignalio 
k e pt high and oo SRAM wordlin e io not activat e d e rron e ouoly. Th e delay of thio chain io not 
oritioal oa long as it io ohort e r than tho d e lay of tho matohline. Th e worot oaoo wire loading (tho 
full height of th e Cam banlc) and aloo the fanout of all tho olo e w e d NOR gatoo or e aloo modeled. 
This delay is around 3P01 ^hich is lee s than the delay of GE lines and so it is ok» 
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Gnd 

Flgpire 17 -Basic SRAM ceU [FIGURE DROPPED] 



The basio SRAM cell i s s hown in Fig. 17. The NMQS size of the oell (8^) was oho s en 
ouoh that during a r e ad op e ration, th e low voltag e on th e SRAM output do e o not inor e ao e more 
than 500m V. Thi s give s enough margin s uch that even if the pas s gate tran s istor become s 
stronger due to mismatch and process variations, still the read operation is done without 
deotroyiqg th e SRAM cont e nt 

Tho modeling of tho bitlino w i ro load and also the loading of the other rowa ia done tho 
s ame way as the CAM cell and i s s hown in Pig> 18i Precharge transistor s are added for read 
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SRAM 



lig^r^ 18 - SRAMmtili wire and load from oth^ rows [FIGURE DROPPED] 



Tho SRAM rood and write oirouiti>^ io ahown in Fig. 19. 
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^Btb 




IlgMre 19- SRAM and mite drcuiliy [FIGURE DROPPED] 



For r e ad oirouitr>', w e hav e 2 owitoh e o whioh op e n th» bitlin e o to th e input of th e S e no e 
Amplifier (not abown). Tho 3QfF oapooitoiD Ofo modoling tho oonoo omp input oopooitofo. 

For d es igning the writ e drivers, again lik e th e CAM write oirouitr>^, w e did not have to 

minimize th e d e lay. Sinoo tho write' oirouitr>^ io otatio, it hao ono whol e oyolo to oomploto ito 
op e ration. Our only oonstmint wa s bitlin e input oapaoitano e (60% in this oase). Th e d e lay of th e 
forkioequnliaed. The delay of bitline is around 3.7 F01 and the bitlinejbar around 3 J3FO^ from 
hand ooloutotions. 

Th e SRAM read and writ e owitoh e o ohould b e driv e n. Th e driving oirouito ar e ohown in 

Fig. 20. Ono io driven by * Enable' to during a read op e ration and tho other on e io driven by 
^Updat e * for a write opemtion. The output s of the s e logic s see the oapaoitanoe of a wire the 
length of tho SRAM banlc width (pluo tho owitoh tranoiotoro input oapaoitanoea). The delay of th e 
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2 patho ar e equalized. Th e ^Swam' d e lay io around 2.9 F01 and th e 'Swbram' d e lay io around 
2J FOI dola>^. For 'Swnrniw^ and ^Swbramvv" tfaio delay io almost 1.9 PQ1. 




FigMre 20 -SR4Mirad and write switches drivi^ [FIGURE DROPPED] 



To furth e r oav e pow e r during th e p e riod which th e CAM bank io inactiv e (not e nabl e d), 
th e olook io aloo ohut off wh e n ' e nabl e * io low. Th e oirouit ohown in Fig. 21 io uo e d for both 
driving th e olook load and ohutting off the oloolc When 'enable* io low^, 'oik' io k e pt high and 
'olkb' io le a pt low and oo Ih e r e will not b e any pieohaig e . 




IlgMTe 21 -Oock drive draiitiy [FIGURE DROPPED] 
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The outputo of oik' oooo the oapaoitanoo of a wire th e l e ngth of tho SRAM banlc width 
and Iho SRAM biliin e preohrug e tnonEdotom The d e lay fi>r 'ollc' gignol ia around 4 FQ4. 

Th e 'olkb' oignal output o ee o th e load from a wir e vvhioh runo th e fiill h e ight of th e CAM 

bank. It also sees 4 00 preoharge transistors, out of which 4 transi s tors are pre s ent in the oirouit. 
Th e oth e r 396 ar e mod e l e d by a dummy tranoiotor load. Th e oapaoitor load of this dummy 
transistor io ohoo e n to b e v e ry small b e oauo e in actual oao e also only on e of th e matohlin e o 
s hould be preoharged and the others s hould be at their high value. The delay for 'olkb' signal is 
around ^ F01, 

Th e d e lay of th e lookup was m e asur e d by on e of th e s e 2 m e thods: 1 Th e d e lay from the 

rising edge of the clock to the rising of SRAM wordline was measured and then 0.65ns wa s 
add e d for th e Sens e Amp r e ad op e ration , or 2 Th e d e lay from th e rising e dg e of th e olockto th e 
point wh e r e th e SRAM bitlin e output chang e s b>^ 250m V was m e asur e d. Th e s e numb e rs (0.65no 
d e lay or 250m V wore both oxtraotod from EE313 notes). Both methods result e d in th e sam e 
delay value. Tho ovomll evaluation delay of th e lookup operation io around 3 .5 no from the olook 
edge (in TTSS comer). BUT thi s delay from the falling edge of *clkb* signal is only 2i65nfi i So 
if tho FF for this stag e io driv e n by this delay e d oloolcod wo can borrow some time from tho next 
olook oyol e , as shown in Fig. 22. Sinoo tho next opomtion io Priority Enoodor and that operation 
iofLill>f static, it can afford to lend som e tim e to the previous opemlion. 

Original Clock 



CAM Clock 

ngure22-C^deBoiTowiiig [FIGURE DROPPED] 

Th e 2.65ns op e ration allow s us to operate at a 7ns cycl e tim e (considering 1 .5 F04 clock 
olo B W, 2FO^ FF delay and ^5V t» o>role time). 
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The input — oapaoitonoes — for — this — stage — are — as — follow s : — Cloc k- 1 96Xtv 78ff , 
EnQble-250A.rv lOOfF, UpdatB-235a.ry94 ff; 

&1»3 Decoder design and ]\fanoiy write timin jg 

The decoding is done in 2 oyole s . First 8 bit deooding i s done to ohoo s e one of the 256 

banlco on th e ohip (th e otruotur e io shown earli e r in th e r e port). In th e n e xt oyol e , a 7 bit d e coding 
is done to ohoooe the row of CAM SRAM banlc. In the oame oyole th e data io written in the 
CAM and SR^MVl. So this o e oond d e ooding io mor e oritioal than th e first on e . Th e 7 bit addr ess- 
s pace is divided to a 3bit and ^bit addre ss spaces. So we predecode on 3bit, then predccode on 
ibit and then oombino them. Th e 3 bit prodooode is faster than the 1 bit one, ooth e oirouitfortho 
4 bit pr e deood e io d e sign e d Th e oirouit io oho>vn in Hg. 23. 




F^Mre 23 -4:16 Decoder [FIGURE DROPPED] 

Firot a 2:^ deooding io don e . Then a 1:16 deooding io performed. The 16 r e sulting lin e o 
ore th e n run through th e whol e h e ight of the CAM — SRAM bonk to do the 16.8=128 decoding. 
Th e e quival e nt wir e oapaoitano e is 262fF whioh is used in th e simulation. Aft e r the final 
d e ooding stag e , th e global wordlin e io driv e n. Th e oapaoitano e of thio lin e (almoot lom long) io 
around2pF. The r e oistanoe of thi s line is around 125 ohms, whioh wa s ignored. Con s idering thi s 
rooiotano e in our oizing oaloulationo oauo e o th e buffer stage driving global wordline to become 
ver>f large, which is unneceooary, since the decoding time i s not that critioal. The wire resistance 
adds around 2 3F01 d e lay to th e total d e lay of 10 FO^, which io well within acceptabl e rang e . 
Th e oimulat e d d e ooding tim e io 2nQ and th e ov e rall write time io m e aourod ao 2.53nQ. So writing 
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op e ration io not a probl e m and oan b e don e w e ll within th e int e nd e d oyoU tim e of 7no. 
(Rem e mb e r that botii tho doooding and write oporationo are ctotio opomtiono and oo oan bo dono 
during Ihe wiiol e o>ol e ). 

Sinc e th e writ e op e ration d e p e ndo on th e otr e ngth of th e o e ll PMQS and th e pooo gat e 

NMOS (NMQS is s tronger), in SFSS corner write operation may fail. This corner was oheoked 
and th e writ e operation woo don e oom e otly. - 
The addre ss input capacitance i s (9.6fF). 

5.1.4 Priority Encoder 




FigMre 24 -IVIcn% Encoder Tree Structure [FIGURE DROPPED] 
Our priority encoder e nood e o priority among tfio CAM banlco. Since theo e are diotributod around 
the chip th e re ar e lilc e ly to be v e ry long wir e o oom e wh e re in th e d e cod e r. Following th e principl e 
of putting th e wir e oapooitano e g as far in the gat e chain as po ss ibl e , all th e gates are n e arly 
alway^c dominated by wire oapaoitanoeD (expect for certain gates with huge fanouto)^ Also since 
w e hav e only on e priorit>^ e ncod e r, area io not r e ally an issu e . So, w e w e nt in for a diotributo d 
static troo PE, This also allow e d this pipeline otago to lond some time to th e CAM ooaroh stage 
increasing our looloy s peed further. For simplicity of design as well as pow e r considerations we 
went in for a two stag e 16x16 (356) d e cod e rs as shown in th e floorplan. B e tw ee n th e two 
dooodoro w e had 40 AND gate stages oo tho availablo - time was divide equally botwoon thos e 
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gat e o. Taking 7no oyol e tim e , giving 0.8np to th e CAM stag e and aooounting for flipflop d e layo, 
olook otovv and un e v e n olook duty oyolo w e hav e to moot a total d e lay opoo of 5.1no. While we 
want e d to bo oomfortably inaid e thia opoo without woiating aroa and power on largo faot gotoo ao 
we ohooo e 3.5no oq the targ e t 

A typtoal stngp (AND gate) in the PE looks as shown in figure 25. 



To g e t a total dolay of 350pQ (* lOii ., ) per otag e >\Titing the delay e quationa wo g e t a fanout of 9. 1 
across the whole gate. The wire capacitance s and gate load were calculated from figure by 
tracing th e pooitioning of e aoh gat e r e lativ e to th e layout on th e ohip. Th e oritioal patho ar e 
ohown in the figure 21 . The green path io the oritioal path for th e firot otage and th e rod io that for 
the oooond stag e . Simulations at th e TTSS oornor gave a delay of 3. 8 no including all the wir e 
oapaoitano e o m ee ting th e op e o oomfQliabl>^ 



The main tx)ureec of po\ver cQncumpdon are qb fcllowK 

4 — Bitiin e o of OCM and CAM (and SRAM'o) and oth e r lin e s inaide tk wH 

2 — Global wires inside the ohip, mainly the IP number bus, in thi s category there are al s o 
global word linos butthoy ato only aotivo during tho update. Sinoo update froquonoy 
is a v e i>f email firpotion of loolcup fi e quenoy, w can ignore tiiio pow e rr 

3 — Pow e r oonoumod in th e logic, like in Priority Encoder and CAM logio. Tho same as 
above, >to can ignore tho power burnt in tho dooodor. 




FigurelS - T^rpical stage in die FE (AND gate) [FIGURE DROPPED] 



6,2 Pow e r Estimat e 
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W e oimulat e dth e CAM'o, CCMand Priorit>^ Enood e roatfc e in l oomor and int e grat e d 
tfi e ourrentthroi^ Vdd to obtain tho power. H e r e or e th e teoultD: 

4 — For 32 bit 1 00 Qntr>^ CAM bonlc, w e obtained th e av e rag e e n e rg>^ por oyolo to bo 0. 1 5 
nJ and th e av e rag e e n e rgy p e r oyol e wh e n all th e bitlin e o owitoh to b e 0.83 nJ. So 
0.68nJ is due to bitline9> Since at any time half the bitlines switch on average, then 
bitlin e o oonoum e about 0.31nJ p e r oyol e . So th e av e rag e oonijumption p e r o>^ol e for 32 
bit CAM banlc is 0.^9nJ. At e aoh oyol e , 25 CAM banlto for only on e ohip ar e 
activated. So the total energy consumed in CAM banks (bitlines and logio) will be 
12^25 nJ p e r oyol e . 

2 — For 32 bit CCM, th e en e rgy^ oonoum e d p e r oyol e is 34 pJ for all bitlineo owitohing. So 
the average energy consumed i s almost half of this value which is 12 pJ. Since in 
e ach oyol e 256 of th e o e CCM'o activat e in on e ohip, they will oonoum e 3.1nJ of 

3 From oimulation, Priority Encoder vvaa oonouming 0. 16nJ per oyol e , Thio woo for tho 
oritioal path of tho PE. Aoouming a factor of 2 for tho whole encod e r and oonoidoring 
that we have 17 of these PE^s on each chifi the overall energy con s umption by PE 
will bo 5,inJ per oyol e . 

4 — Now we ohould calculat e tho energy for charging up IP number buoeo inoid e the 
activat e d ohip. From chip floor plan w e hav e 5 of th e o e buo e o. Th e o e buo e o ar e lorn 
long and so e aoh lin e ha s a oapaoitonc e of 2pF. So th e total oapaoitano e io 320 pF. 
Sinoe on av e rag e half of th e o e lin e o owitoh, th e e nerg>^ p e r oyol e will be 1.75 nJ, If a 
match i s found, we only activate on e port addres s bus: Since the port addre ss bus is 
only 5 bit long, it s oonlribution to eneigy consumption i s negligible. 
If w e oum up th e o e numb e ro, th e ov e rall e n e rgy^ consumption per oyol e (for all chips) will 
be 22.5nJ per cycle. For 50 MHz clock frequency (30 ns cycle time) this is equivalent to 1. 125 
W overall or 1125/ 8 fyi11mW por ohip. For 7no oyol e tim e (li2MHz olook), thio power io 
400mW per ohip. 
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6»0 Concluding Remark s 

Whcm \¥P oot out to docido Ifao overall orohitBotire of our dooign wo mcido tfao foll ewflg 

oboervotiono: 

■4 — Ifafdware imFJemeittations invariabl^^ had poor algpritima design and hence burnt a lo fe^ 

— Software (or proooooor boo e d) d e oign ^lilo ho m ing a good olgorithmio design ouflRmsd fixmi 
tfaBmemQr>f — prooeosor buo bottlen e ck and \vpre hence serial in nature and also consumed a 
lot of area but storiiig ineflSoi e nt dafei stmoture ; 
Sinoo \vo gtartod with viigin dlioon it did not malm oonoo to tolap eith e r of IhooQ approaoh e o ^itf 
to com e up with a e flSoi e nt parall e l d e oign combining the b e ot of both vvorldo. 
Oveiall w e Ifaink we hit all Ih e Ifare e k e y pino: omall e ot area, lowwrt power and higbopt ope ed 
without going in any extrem e diiootioa Aloo, it ahoiJd be remombored that our hoohing io 
totally fl e xiMe without d e p e ndiiig on aiy IP di s tribution s totistioQ in the addre ss space. 

7y0 G e n e ra li zation s and Ext e nsion s 

Our hardware baoed o e aroh and pro olaoaifioation idoao are not only limited to th e 

impl e m e ntation d e oorib e d in thio r e port. Throughout th e r e port w e point e d out oom e 
gonorolizationo. Th e following liot o^q^no a f e w more of th e oo g e n e ialiaationo and oxtonoiona. 
Description and Operation ofAltemadve Embodiments. 

1 - We are not limited to SRAM for implementing our CAM and CCM cells. Any kind of 
memory cell including DRAM can be used as the storagp element (ciicuitiy/device). 

2 — Any Idnd of matohlin e impl e m e ntation can b e uo e d for th e CAMo. In th e curr e nt 
impl e m e ntation w e used o e ri e o tranaiotoro in our CAM matchlin e (NAND lil» 
matohline). Parall e l tranoiotor implementation oftho matohlin e (NOR like matohline) or 
any combination of the two mi^ be used ao well. 

3- In this implementation CCMs were used for doing the 'greater than or equal to' 
operation In general CCMs can be used for any comparison operation. 
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4- The word length for CAMs and CCMs does not have to be 32 hits. The same ideas 
explained in this report Nvorics for any arbitrary word length. 

5- The CAM bank size can be chosen artitrarily (In this in^ementaticm it was 100). 

6- We had 3 levels of pre-classificationinthis implementation, out of which 2 of them 
where in the address space (Fig. 1) . The number of levels of pre- classification is not 
central to our idea and can be chosen as appropriate for the particular apjiication. 

7 - W e ar e not limit e d to th e oingl e phao e olooldng ooh e m e , ao was uo e d in thio 
impiemet^ tetieRr 

8- By providing multiple matchlines for each storage element in our CAMs, we can 
perform several lookups in parallel and further speed up our search. 
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